ECI Benchmark Forecast Backtesting

Compare forecasted benchmark scores against actual SOTA performance across different trend methods and horizons

Methodology

For each historical backtest date, we:

  • Filter data to models released before that date
  • Fit an ECI (Epoch Capabilities Index) model using IRT
  • Extract trend data for top-N models by ECI at release
  • Fit a trend using either BIC model selection or Last-M-Months linear regression
  • Forecast ECI forward and convert to benchmark scores via sigmoid(slope × (ECI - EDI))
  • Compare against actual SOTA at the target date

Trend Methods

  • BIC: Compares linear vs piecewise (single breakpoint) models using Bayesian Information Criterion. Uses the slope after the breakpoint if piecewise is preferred.
  • Last M Months: Simple linear regression on only the last M months of data, where M equals the forecast horizon.

Confidence Intervals

CIs are computed using prediction interval formulas that account for:

  • Slope uncertainty from the linear regression
  • Residual variance around the trend
Calibration Note: The current CIs are under-calibrated—approximately 53% of actuals fall within our 90% CIs (vs. the expected 90%). This is because we don't account for benchmark parameter uncertainty (EDI, slope) or model selection uncertainty. Future work could use bootstrap sampling for better-calibrated intervals.
Method Comparison (All Benchmarks)
Mean Absolute Error
Bias (Forecast − Actual)
Explore by Configuration
Metrics
Forecast vs Actual Over Time
Error Distribution